Compute Capability (CC) acts as the versioning bridge between virtual architecture (PTX) and real architecture (SASS/Binary). Developers use nvcc to target specific platforms, ranging from desktop/server platforms to embedded platforms, across OS models like Linux 64-bit (LP64) or Windows 64-bit (LLP64).
1. Virtual vs. Real Architectures
The CUDA Toolkit supports GPU architectures from the last two major versions, referenced in Table 29: Feature Support across Compute Capabilities (7.5 to 12.x). We define mappings using flags like: nvcc --generate-code arch=compute_80,code=sm_90 prog.cu. For future-facing targets, flags like nvcc -arch=sm_100 or specialized variants like nvcc -arch=sm_100a are used.
2. The Macro Hierarchy
The compiler uses __CUDA_ARCH__ to branch code. The macro __CUDA_ARCH__ is only defined in device code (e.g., __device__, __global__). More granular control is provided by __CUDA_ARCH_SPECIFIC__ and __CUDA_ARCH_FAMILY_SPECIFIC__. Certain features, like Distributed Shared Memory or specific NaN payloads, require Compute Capability 9.0+ or Compute Capability 10.0 and later.
3. Numerical Limits & Constraints
Precision varies by CC; for example, subnormal handling ensures $2^{-16382} \approx 3.36 \cdot 10^{-4932}$. Hardware limits such as CUDA_DEVICE_MAX_COPY_CONNECTIONS=16 or the .maxnreg PTX directive are strictly enforced based on the target CC version.